Week 1: Important Statistical Concepts

PSYC 3032M

Udi Alter

January 2024

Back to Basics: Review of Important Statistical Concepts

Descriptive Statistics

Here are the miles-per-gallon observations (mpg) from the mtcars data. What can you say about it?

mtcars$mpg
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4

Not a lot, right? Are miles-per-gallon similar across observations? How similar? If I pick some observation at random, what mpg value should I expect? As humans, simply looking at rows and columns of numbers doesn’t give us much insight.

Descriptive statistics are summaries of the data that humans can easily comprehend. To describe data, we often rely on measures of central tendency (e.g., mean, median), measures of variability (e.g., standard deviation, variance), and, crucially, visualizations (e.g., histograms, box plots, density plots). When we use descriptive statistics to summarize the data and illustrate its features, we can actually understand the data, making it easier to spot patterns, detect issues, and uncover potentials.

# loading packages
library(misty)  # for descriptive statistics
library(tidyverse) # for data hanging and visualizations, i.e., ggplot2, dplyr, etc. 
library(plotly) # for dynamic and interactive plots!!!

misty::descript(mtcars$mpg)
##  Descriptive Statistics
## 
##    n nNA   pNA     M   SD   Min   Max Skew  Kurt
##   32   0 0.00% 20.09 6.03 10.40 33.90 0.67 -0.02
summary(mtcars$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90
p1 <- ggplot2::ggplot(mtcars, aes(x = mpg)) + 
  geom_histogram(binwidth = 3.85, fill = "royalblue", color = "black", size = 1, alpha = 0.8) +
  theme_minimal()

plotly::ggplotly(p1)

What can you tell me about the mpg data now?

Measures of Central Tendency

Mode: The most frequently occurring observation. Best to use with categorical or ordinal data, for example, type of cars, yes/no responses, Likert-type scales (e.g, from 1=strongly disagree to 5=strongly agree) etc.

Median: Middle score in a distribution (i.e., 50% fall above, 50% fall below). For an even number of scores, take the average of the two middle scores. Best to use with data that shows non-symmetric distribution (e.g., mpg).

Mean: Average of the scores \[\bar{x} = \frac{1}{n} \cdot \sum_{i=1}^{n}x_i \]

This particular statistic has special meaning in statistical theory in that it relates to the expectation of a variable, \(E(X)\), \(E(X) =\) (weighted) average of random variable \(X\) when number of observations in \(X\) is large

Measures of Variability

Range: highest \(x\) minus lowest \(x\), or \(max(x) - min(x)\). Range is a pretty crude measure of variability in that it relies exclusively on the most extreme observations and ignores the rest of the data. Thus, it can be misleading at times.

Interquartile Range (IQR): The range of the middle 50% of the observations (i.e., ignoring the most extreme 25% of the observations from each tail).

Variance (VAR): Noted \(\sigma^2\) for population variance, and \(s^2\) for sample variance. Variance is the average of the squared deviations from the mean,

\[VAR(x)= s^2 = \hat{\sigma}^2 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^2 }{n-1} = \frac{SS}{n-1}\]

where \(SS\) is referred to as the sums-of-squares. We divide by \(n − 1\) instead of \(n\) to adjust for the fact that need to use an estimate of the mean of the population, \(\mu\), which increases our variability measurement.

Like the mean, variance is important for statistical theory because its behaviour can often be studied using analytic sampling distributions. Although, theoretically very important in statistical inference, as we’ll encounter many times in this course, the variance generally has some interpretation problems due to the fact that it is in the “squared” metric of X. To fix this, we can use the standard deviation.

Standard Deviation (SD): Noted \(\sigma\) for population SD, and \(s\) for sample SD. SD is the square-root of the variance. The advantage of the SD over the variance is that it is in the same units as the original variable X because we are undoing the effect of squaring and summing the differences from the mean.

Measures of Shape

Things get trickier when we start talking about general shape of a distribution. Is a variable symmetric/skewed? Uni/multi-modal? Has thick/thin tails? There are some helpful statistics in this area that are generally in reference to the normal (also called Gaussian) distribution.

For example, skewness estimates start the value of 0, indicating symmetry (like the normal distribution), and can be positive or negative to indicate positive/negative skewness.

Here’s a nice mnemonic to help you remember which from the above is the positively skewed distribution and which is the negatively skewed distribution. Look at the yellow-ish distribution above (the positively skewed one), the one with its long tail on the right. Now, imagine this distribution rotating clockwise 90 degrees, which letter does this remind you of? P! for Positive!

P for positively skewed
P for positively skewed

Kurtosis estimates have a similar property; 0 depicts as peaked as the normal, negative values more flat with “thick” tails, positive values more peaked with “narrow” or “thin” tails. At the end of the day, it generally makes more sense just to plot the data to get a feel for the overall shape.

Parameters

Parameters are fixed numerical values that describe specific characteristics of an entire population, such as the true mean, variance, or proportion. Typically, parameters are what we wish to know or uncover about the population of interest (e.g., all university students or all researchers in North America). Unlike statistics, which are derived from sample data and serve as estimates, parameters represent the actual values for the population, though they are often unknown and must be inferred using—yes, you guessed it—inferential statistics. In research, parameters are the key quantities we aim to estimate, giving us a clearer understanding of the population as a whole.

A parameter is a single, objective, and fixed value, though typically unknown to us. When running an experiment, we calculate a statistic from the sample as a proxy (i.e., estimate) for the unknown population parameter. The population parameter is singular because there is only one true value it can take. It is objective because it represents an unequivocal truth, even if we cannot know its exact value. Finally, the population parameter is fixed—it remains constant at the time of observation. As you’ll soon see, Bayesian philosophy offers quite a different perspective on probability and parameters.

By convention, these are typically expressed in Greek letters, for example:

  • \(\mu\) = (population) mean
  • \(\sigma\) = standard deviation
  • \(\sigma^2\) = variance
  • \(\rho\) = correlation coefficient
  • \(\beta\) = regression coefficient

But, clearly, there aren’t enough letters in the alphabet to describe all possible parameters (e.g., we could talk of a population median).

Statistics and Estimates

Statistics are the numerical summaries we calculate from a sample of data to help us make sense of it. Think of statistics as the little clues or snapshots we gather from a subset of the population, like the sample mean or variance, which give us a peek into what the whole population might look like. Unlike parameters, which describe the entire population, statistics are our best guesses based on the data at hand. They’re flexible, subject to change with different samples, and are the main tools we use to estimate the unknown truths hidden in the population.

Estimates are the values we calculate from sample data that serve as stand-ins for the unknown population parameters. Since we can’t usually measure the entire population, we rely on estimates—like a sample mean or proportion—and a few assumptions—e.g., the observations are independent from one another and were randomly selected from the population—make informed and educated guesses about the true population values. These estimates aren’t perfect, but with the right sample and methods, they get us pretty close. They help bridge the gap between the data we have and the broader, often elusive, population parameters we’re trying to figure out.

Often, estimates are given a special “hat” to indicate that they are estimates, for example:

  • \(\hat{\mu}\) = mean estimate
  • \(\hat{\sigma}\) = estimated standard deviation
  • \(\hat{\sigma^2}\) = estimated variance
  • \(\hat{\rho}\) = estimated correlation coefficient
  • \(\hat{\beta}\) = estimated regression coefficient

That said, many researchers and educators prefer to use different notations altogether, for example:

  • \(M\) = mean estimate
  • \(s\) = estimated standard deviation
  • \(s^2\) = estimated variance
  • \(r\) = estimated correlation coefficient
  • \(b\) = estimated regression coefficient

Relationship Between Statistics and Estimates

Statistics and estimates are closely connected, like two sides of the same coin. When we calculate a statistic—such as the mean, variance, or proportion—from a sample, it becomes an estimate of the corresponding population parameter. In other words, the statistic is the number we compute from our sample data, and we use that number as our best guess (i.e., estimate) of the unknown value we’re interested in that describes the entire population.

So, while the statistic is simply the number we calculate from the sample, the estimate is the role it plays in standing in for the unknown population parameter. Together, they form a bridge between the sample we have and the broader population we’re trying to understand. The accuracy and precision of an estimate depends on the quality of the statistic and how well our sample represents the population.